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Abstract 


This  paper  analytically  studies  the  performance  of  a  synchronous  conservative  par¬ 
allel  discrete-event  simulation  protocol.  The  class  of  simulation  models  considered  are 
oriented  around  a  physical  domain,  and  possess  a  limited  ability  to  predict  future  be¬ 
havior.  Using  a  stochastic  model  we  show  that  as  the  volume  of  simulation  activity 
in  the  model  increases  relative  to  a  fixed  architecture,  the  complexity  of  the  average 
per-event  overhead  due  to  synchronization,  event  list  manipulation,  lookahead  calcu¬ 
lations,  and  processor  idle  time  approaches  the  complexity  of  the  average  per-event 
overhead  of  a  serial  simulation.  The  method  is  therefore  within  a  constant  factor  of 
optimal.  Our  analysis  demonstrates  that  on  large  problems — those  for  which  parallel 
processing  is  ideally  suited— there  is  often  enough  parallel  workload  so  that  processors 
are  not  usually  idle.  We  also  demonstrate  the  viability  of  the  method  empirically, 
showing  how  good  performance  is  achieved  on  large  problems  using  a  thirty-two  node 
Intel  iPSC/2  distributed  memory  multiprocessor.  .  . 
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1  Introduction 


The  problem  of  parallelizing  discrete-event  simulations  has  received  a  great  deal  of  attention 
in  the  last  several  years.  Simulations  pose  unique  synchronization  constraints  due  to  their 
underlying  sense  of  time.  When  the  simulation  model  can  be  simultaneously  changed  by 
different  processors,  actions  by  one  processor  can  affect  actions  by  another.  One  must  not 
simulate  any  clement  of  the  model  too  far  ahead  of  any  other  in  simulation  time,  to  avoid  the 
risk  of  having  its  logical  past  affected.  Alternately,  one  must  be  prepared  to  lix  the  logical 
past  of  any  element,  determined  to  have  been  simulated  too  far. 

Two  schools  of  thought  have  emerged  concerning  synchronization.  The  const  real  ire 
school  [5],  [13],  [23],  [24]  employs  methods  which  prevent  any  processor  from  simulating 
beyond  a  point  at  which  another  processor  might  affect  it.  Those  synchronization  points 
need  to  be  re-established  periodically  to  allow  the  simulation  to  progress.  F.arly  efforts 
focussed  on  finding  protocols  which  were  either  free  from  deadlock,  or  which  detected  and 
corrected  deadlock  [17].  The  optimistic  school  [7]  allows  a  processor  tosimulate  as  far  forward 
in  time  as  it  wants,  without  regard  for  the  risk  of  having  its  simulation  past  affected.  If  its 
past  is  changed  (due  to  interaction  with  a  processor  farther  behind  in  simulation  time)  it 
must  then  be  able  to  “rollback"  in  time  at  least  that  far,  and  must  cancel  any  erroneous 
actions  it  has  taken  in  its  false  future. 

Conservative  protocols  are  sometimes  faulted  for  leaving  processors  idle,  due  to  overly 
pessimistic  synchronization  assumptions.  It  is  almost  always  true  that  individual  model 
dements  are  blocked  because  of  pessimistic  synchronization;  the  conclusion  that  processors 
tend  to  be  blocked  requires  the  assumption  that  all  model  elements  assigned  to  a  processor 
tend  to  be  blocked  simultaneously,  or  that  each  processor  has  only  one  model  element.  The 
latter  assumption  pervades  many  performance  studies,  and  is  unrealistic  for  fined-grained 
simulation  models  executed  on  coarser  grained  multiprocessors.  Intuition  suggests  that  if 
there  are  many  model  elements  assigned  to  each  processor,  then  it  is  unlikely  that  all  model 
elements  on  a  processor  will  be  blocked.  Given  sufficient  workload,  a  properly  designed 
conservative  method  should  not  leave  processors  idle,  because  there  is  so  much  work  to  do. 
While  some  model  elements  are  blocked  due  to  synchronization  concerns,  other  elements, 
with  high  probability,  are  not. 

It  is  natural  to  ask  how  much  performance  degradation  due  to  blocking  a  conservative 
method  suffers.  We  answer  that  quest  ion.  by  analyzing  a  simple  conservat  ive  synehronizat  ion 
method.  The  method  assumes  the  ability  to  pre-sample  activity  duration  times[20],  and 
assumes  that  any  queueing  discipline  used  is  non-preemptive.  The  protocol  itself  is  quite 
simple.  As  applied  to  a.  queueing  network  it  works  as  follows,  f  irst,  whenever  a  job  enters 
service,  the  queue  to  which  the  job  will  bo  routed  is  immediately  notified  of  that  arrival 
(sometime  in  the  future),  and  the  receiving  queue  computes  a  service  time  for  the  new  arrival. 
These  two  actions  constitute  lookaht  ad,  a  concept  which  is  key  to  t  ho  protocol's  success.  Now 
imagine  that  all  events  with  time-stamps  less  than  /  have  already  been  processed  and  that 
t  he  processors  are  globally  synchronized,  for  each  queue  we  determine  the  time-stamp  of  the 


next  job  it  would  route  (excluding  one  in  service)  if  no  further  arrivals  occur  at  that  queue. 
The  processors  cooperatively  compute  the  minimum  such  time,  say  8(t).  We  will  show  that 
all  further  messages  to  be  sent  in  the  simulation  have  time-stamps  at  least  as  large  as  8(t). 
Consequently  a  processor  may  evaluate,  in  parallel  with  all  other  processors,  all  of  its  events 
with  time-stamps  less  than  8(t).  Having  done  so,  the  processors  synchronize  globally,  and 
repeat  the  process.  The  interval  [/,£(/))  is  called  a  window,  and  6(t)  —  i  is  its  width. 

We  analyze  the  performance  of  the  protocol  by  first  deriving  an  approximated  lower  bound 
on  the  equilibrium  mean  window  width.  We  then  multiply  this  width  by  the  equilibrium  rate 
at  which  the  simulation  generates  events.  The  resulting  product  is  an  approximated  lower 
bound  on  the  the  average  number  of  events  that  are  processed  within  a  window.  We  then 
identify  conditions  under  which  the  average  number  of  events  processed  in  a  window  increases 
without  bound  as  the  system  simulation  event  generation  rate  increases.  Next  we  analyze  the 
synchronization,  idle  time,  lookahead  calculation,  and  event-list  overheads  of  the  protocol  as 
a  function  of  /’,  events  in  the  system  at  a  time.  The  average  overhead  per  processed  event  is 
shown  to  be  0(/(7’)),  where  /('/’)  is  the  complexity  of  the  average?  per-event  overhead  in  a 
optimized  serial  simulation.  Therefore  the  protocol's  asymptotic  performance  (as  T  — ►  oc) 
is  within  a  constant  factor  of  optimal.  Finally,  we  demonstrate  the  viability  of  the  protocol 
empirically.  A  parallel  simulation  system  based  on  the  protocol  has  been  implemented  on 
a  thirty-two  node  Intel  iPSC/2  distributed  memory  multiprocessor^].  Processor  efficiencies 
in  the  range  of  60%  -  90%  are  reported  for  several  different  large  simulation  models. 

It  is  important  to  remember  that  our  analysis  concerns  average  case  performance  based 
on  a  general  stochastic  model.  Specific  problem  examples  can  be  constructed  to  ensure  that 
the  protocol  essentially  executes  serially,  while  another  can  execute  many  things  in  parallel. 
We  believe  that  such  examples  are  somewhat  artificial  and  do  not  shed  a  great  deal  of  light 
on  how  performance  will  behave  over  a  wide  range  of  problems.  Our  intention  is  to  study 
the  average  case  performance  on  a  model  of  typical  simulation  problems. 

This  paper  makes  two  basic  contributions.  One  is  to  develop  a  new  approach  for  the 
analysis  of  parallel  discrete-event  simulations.  The  second  is  a  demonstration  that  many 
large  simulation  models  having  much  concurrent  activity  can  be  effectively  simulated  in 
parallel  using  a  simple  conservative  protocol. 

This  paper  is  organized  as  follows.  §2  gives  some  background  for  this  work.  §3  describes 
the  model  of  discrete-event  simulations  we  use  in  our  protocol  description  and  analysis,  and 
then  introduces  the  protocol.  §4  derives  an  approximated  lower  bound  on  the  average  number 
of  events  processed  in  a  window.  §5  determines  the  complexity  of  the  average  total  overhead 
per  event  suffered  by  using  the  protocol.  §6  reports  on  the  performance  of  the  protocol  on 
several  different  simulation  models.  §7  gives  our  conclusions. 
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2  Background 

Our  protocol  is  similar  to  others  recently  proposed  [1],  [-1],  [13],  [14],  [28],  Unlike  earlier 
asynchronous  protocols,  these  synchronously  move  a  window  across  simulation  time,  roughly 
as  follows.  Let  floor  be  the  lower  edge  of  the  window.  This  means  that  all  events  with 
time-stamps  less  than  floor  have  already  been  processed.  The  processors  then  cooperatively 
determine  the  upper  window  edge,  ceiling.  This  value  is  chosen  in  such  a  way  that  all  events 
within  [floor,  ceiling)  can  safely  be  processed  in  parallel,  ceiling  becomes  floor  for  the  next 
window,  and  so  on. 

The  major  question  with  synchronous  conservative  protocols  is  whether  windows  small 
enough  to  prevent  dependencies  between  window  events  admit  enough  such  events  to  keep 
all  the  processors  busy.  Lubachevsky  was  the  first  to  answer  this  question  [1  1],  by  deriving 
a  lower  bound  on  the  number  of  events  processed  within  a  window  defined  by  his  method. 
Using  this  bound  and  some  assumptions  concerning  event  density  (in  simulation  time),  he 
shows  that  the  performance  of  his  method  scales  up  as  the  problem  size  and  number  of 
processors  are  simultaneously  increased.  However,  his  results  are  not  quantitative,  although 
they  might,  have  been  so  developed.  Our  analysis  is  different,  in  that  we  define  a  model 
from  which  event  densities  follow  naturally,  and  we  quantify  the  average  number  of  events 
processed  in  a  window.  Ours  is  an  average  case  analysis,  while  Lubachevsky 's  is  a  worst 
case  analysis.  Also,  Lubachevsky "s  analysis  hinges  on  the  assumption  of  a  non-zero  mini¬ 
mal  propagation  delay,  while  ours  does  not.  We  elo  show  that  minimum  service  times  can 
dramatically  improve  the  average  number  of  events  processed  each  window. 

The  protocol  we  study  is  an  application  of  the  one  described  by  Chandy  and  Sherman  [4] 
to  a  more  restricted  problem  domain.  Like  Lubachevsky "s  method,  they  require  periodic 
global  synchronization  among  processors.  Each  window  their  protocol  computes  the  min¬ 
imum  time-stamp  among  all  “conditional"  events,  and  then  processes  all  “unconditional" 
events  with  smaller  time-stamps.  In  addition,  their  technique  incorporates  the  conversion  of 
“conditional"  events  into  “unconditional”  events,  as  a  function  of  messages  exchanged  in  the 
simulation.  Such  conversion  is  highly  application  dependent.  The  most  important  difference 
between  onr  protocol  and  the  general  conditional-event  approach  lies  in  the  specificity  of 
our  conversion  of  conditional  events  into  unconditional  events,  in  a  way  that  requires  lit¬ 
tle  model-specific  information.  Furthermore,  our  protocol  is  stated  within  the  context  of  a 
model  closer  to  those  used  by  simulation  practitioners  than  is  the  model  used  to  describe 
the  conditional-event  approach. 

Our  analysis  of  lookahead  is  related  to  that  developed  by  1/m  and  Lazowska  in  [10].  and 
by  Wagner  and  Lazowska  in  [30].  Their  work  analyzes  the  ability  of  different  queue  types 
to  predict  future  behavior,  and  focuses  on  lookahead  at  a  single  queue.  Our  analysis  is  of  a 
much  simpler  lookahead  scheme,  but  analyzed  over  the  entire  simulation.  The  protocol  we 
describe  can  be  easily  adapted  to  accommodate  t  hose  more  complex  techniques  for  computing 
lookahead.  We  have  also  analyzed  a  different  class  of  simulations  than  the  one  studied 
here,  on  massively  parallel  ar<  hitect  m»  .-.[It)],  The  sensit  ivity  of  performance  to  lookahead  is 
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quantified,  upper  bounds  on  optimal  and  optimistic  performance  are  derived,  as  is  a  lower 
bound  on  the  performance  of  the  same  protocol  we  study  in  this  paper. 

Some  analysis  exists  of  tin'  optimistic  “d  ime  Warp”  method  of  synchronization.  1  he  ear¬ 
liest  analyses  concerned  detailed  stochastic  models  of  two  processor  systems  [9.  18],  These 
models  include  overhead  costs  and  permit  heterogeneous  processors.  Most  other  studies  of 
Time  Warp  tend  to  assume'  negligible  state-saving  and  rollback  costs.  For  example.  Lin  and 
La/.owska  have  shown  that  if  Time  Warp  has  no  state-saving  or  rollback  costs,  and  if  "cor¬ 
rect"  computations  are  never  rolled  back,  then  Time  Warp  achieves  optimality  [  1  1  ] .  This  is 
intuitive',  because  l  ime'  Warp  aggressively  searches  for  the  simulation's  critical  path  if  it  is 
able-  to  de>  so  without  cost,  its  performance  must  be  optimal.  Other  analyses  highlight  the- 
fact  that  Time  Warp  can  “guess  right"  while  conservative  methods  must  block.  Lipton  and 
Mizell  have'  shown  that  there  is  a  certain  asymmetry  beiween  optimistic  and  conservative* 
me'thods:  while  it,  is  possible*  for  an  optimistic  method  to  arbitrarily  outpeifoim  a  conserva¬ 
tive*  me'thoel.  the  converse  is  not  true  [12].  Their  analysis  explicitly  inchuh's  overhead  costs. 
Madise-tti.  Walrand.  anel  Me'sserschmitt  [  1 G]  have  developed!  a  performance  moch'l  which  es¬ 
timate's  the*  rate  at  which  simulation  time  advance's  under  an  optimistic  strategy  such  as 
l  ime'  Warp.  1'hey  model  the  behavior  of  the  system  as  a  Markov  chain,  and  include  ihe 
cost  of  communication  and  of  synchronizat ion.  Their  analysis  is  exact  for  two  processors, 
anel  approximate'  for  a  general  number  of  proce'sse>rs.  Lobachevsky,  Schwartz,  and  Weiss  use 
a  sophist icate*d  stochastic  moelel  to  show  how  it  is  possible  for  Time  War]-)  simulations  to 
thrash  in  periods  of  “cascading  rollbacks" [15]. 

3  Model  and  Protocol 

We  now  describe  our  moelel  of  eliscrete'-event,  simulations  more  formally,  anel  define  the 
synch ron izat  ion  protocol. 

3.1  Model  Assumptions 

Consieh'r  a  domain  containing  S  sites,  where  activities  occur.  An  activity  (e.g..  service* 
given  te>  a  job  at  a  epieue)  begins,  e'nels.  anel  upon  its  completion  enable’s  (i.e.  causes)  other 
activities.  These  causations  are  reported  to  the  appropriate  sites  by  way  of  comph  tion 
message's.  Consequently,  thre'O  distinct  events  are’  associated  with  each  activity:  enable, 
begin,  anel  complete.  The  enable  event  for  a  given  activity  can  be  elifTerent  from  the  begin 
event  if  the*  site  imposes  queueing.  We  permit  a  completion  te>  cause  more  than  one  act  Litv¬ 
in  oreh'r  to  include  simulation  problems  such  as  Petri-nets,  where  a  single  transition  firing 
may  cause  token  arrivals  at  multiple  Petri-net  places.  Thus,  we  assume  that  a  complete 
event  at  site  S,  causes  an  activity  at  each  member  of  a  random  subset  of  other  sill's.  All 
enable  events  caused  by  a  completion  have  the  same  time-stamp  as  the  completion.  An 
activity  is  said  to  be  oecti rriiif/  at  time  /  if  its  associated  begin  event  has  a  time-stamp  no 
greater  than  /,  and  its  complete  event  has  time-stamp  no  less  than  t.  Kadi  site  maintains 
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its  own  priority  queue  of  events  associated  with  activities  enqueued  or  occurring  at  the  site. 
Each  site  also  maintains  its  own  simulation  clock ,  which  records  the  time-stamp  on  the  last 
event  processed. 

Under  the  assumptions  of  our  model  the  enable  and  complete  events  are  unconditional — 
once  placed  on  the  event  list,  no  further  activity  in  the  simulation  will  change  them,  begin 
events  may  he  conditional.  For  example,  a  begin  event  at  time  /  might  describe  the  future 
placement  of  a  particular  job  into  service  at  a  queue  at  time  t.  If  before  t  another  job  with 
higher  priority  arrives,  that  begin  event  may  be  removed  from  the  event  list. 

Depending  on  the  ability  of  the  site,  activities  may  occur  there  one  at  a  time,  or  concur¬ 
rently.  We  assume  that  either  an  unbounded  number  of  activities  may  simultaneously  occur 
at  a  site,  or  that  only  one  activity  may  occur  at  a  time.  In  the  former  case,  we  say  the  site 
has  infinite  senders.  In  the  latter  case,  enabled  activities  may  be  enqueued  before  occurring. 
The  delay  in  simulation  time  between  when  an  activity  begins  and  ends  is  called  its  dura¬ 
tion.  We  assume  that  a  duration  is  strictly  positive,  but  do  not  assume  a  minimal  duration. 
For  the  purposes  of  analysis  we  assume  that  the  simulation  model  is  ergodic,  and  that  each 
duration  time  comes  from  a  distribution  composed  by  adding  a  nonnegative  constant  to  an 
exponentially  distributed  random  variable.  Each  site-  may  have  a  unique  distribution. 

Our  performance  analysis  rests  on  a  number  of  assumptions  about  the  simulation  model 
which  are  exploited  by  the  protocol. 

1.  We  assume  that  once  an  activity  begins,  the  causation  of  further  activities  cannot  afTect 
its  completion  time. 

2.  We  assume  that  the  simulation  state  change  due  to  an  activity  completion  is  very 
local — the  state  change  is  implied  by  knowledge  of  which  activity  completed,  which 
activities  are  subsequently  caused,  and  the  time  of  the  completion. 

3.  We  assume  that  the  activities  caused  by  the  completion  of  activity  A }  can  be  reported 
to  their  respective  sites  at  Hit  time  that  Aj  bcyut ~. 

4.  We  assume  that  a  lower  bound  on  the  duration  of  an  activity  can  be  determined  at 
the  time  of  the  receipt  of  the  completion  message  which  causes  the  activity. 

To  illustrate  these  assumptions,  consider  a  job  J  which  at  time  s  begins  service  at  a  non- 
preeinptive  queue  Q j,  completes  at  time  .s,  and  is  routed  to  Q2 ■  Assumption  1  is  satisfied 
by  the  nature  of  Qi’s  queueing  discipline.  Assumption  2  is  satisfied  because  the  change  in 
model  state  due  to  this  departure  is  completely  characterized  by  knowledge  of  Q j,  Q2 ,  and 
,s.  Assumption  3  is  satisfied  if  the  service  discipline  at  Qx  is  non -preemptive  and  the  routing 
is  independent  of  the  jobs  enqueued  at  time  s:  in  the  simulation  we  can  report  the  arrival 
of  .7  at  time  s  to  Q2  concurrently  with  the  entering  of  .7  into  service  at  Qt,  at  time  s.  By 
doing  so,  the  processing  required  of  .7’ s  completion  event  at  ,s  does  not  include  reporting 
,7’s  departure,  but  may  include  the  recording  of  statistics  which  depend  on  all  simulation 
activity  at  Q\  (including  arrivals)  up  to  time  s.  Assumption  1  is  satisfied  if  J's  service  time 
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at  Q2  can  be  computed  at  the  time  that  Q\  reports  the  arrival  of  J  to  Q2.  This  is  possible 
if  the  service  time  of  every  job  at  Q 2  is  drawn  independently  from  the  same  probability 
distribution. 

This  model  describes  a  large  number  of  common  simulation  models,  and  is  related  to 
event,  graphs  described  in  [26]  and  [271.  Many  queueing  networks  are  obviously  captured. 
Logic  networks  are  described,  with  activities  corresponding  to  logical  module  evaluations. 
Here  new  activities  are  caused  when  a  module  output  changes  state.  The  movement  and 
interaction  of  objects  in  a  domain  can  also  be  captured.  One  assumes  no  queueing  at  sites, 
and  models  the  passage  of  an  object  across  some  discrete  region  of  the  domain  as  an  activity. 
Lookahead  plays  a  majoi  role  in  our  synchronization  method  and  its  analysis.  Lookahead 
exists  and  is  exploited  by  assumptions  3  and  4  above. 

Simulation  workload  is  the  event  processing.  This  includes  changing  anticipated  event 
times  as  a  result  of  newly  caused  activities,  in  changing  simulation  state  variables,  and  in 
gathering/recording  statistics.  We  view  event  list  management  costs  as  inescapable  overheads 
associated  with  the  processing  of  events. 

Our  protocol  does  not  require  a  minimal  duration  time  for  its  correctness.  However, 
performance  is  substantially  enhanced  if  every  duration  time  is  bounded  from  below  by 
Dmin  >  0.  Equivalently,  we  can  introduce  a  minimal  time  Dm jn  delay  between  when  an 
activity  completes,  and  when  activities  it  causes  are  enabled.  We  will  use  Dmm  throughout 
our  analysis,  but  may  take  it  to  be  zero. 

3.2  Protocol  Definition 

Next  we  define  the  synchronization  protocol  in  terms  of  the  model  given  in  §3.1.  Our  only 
architectural  assumptions  are  that  the  simulation  model  is  executed  on  a  multiprocessor 
having  P  processors;  any  processor  can  send  a  message  (indirectly,  if  needed)  to  any  other 
processor,  and  the  processors  can  synchronize  globally. 

One  important  aspect  of  our  protocol  is  the  “pre-sending”  of  completion  messages.  Let 
Aj  be  some  activity  whose  begin  event  has  time-stamp  s.  Let  s  be  Aj's  completion  time. 
Under  our  protocol  Afs  site  must  send  completion  messages  to  all  sites  where  activities 
caused  by  A-,' s  completion  will  occur,  at  the  time  Aj  begins.  Observe  that  even  though  the 
simulation  time  at  ,4^’s  site  is  s.  these  completion  messages  are  time-stamped  with  time 
s  >  s.  A  site  which  receives  such  a  notification  inserts  an  enable  event  with  time-stamp  ,s 
into  its  event  list  (a  non-queueing  site  may  directly  insert  a  begin  event  with  time  s):  it  also 
selects  a  duration  time  (or  a  lower  bound  on  it)  for  the  newly  caused  activity. 

Suppose  the  processors  have  globally  synchronized,  and  let  t  be  the  minimum  time-stamp 
among  events  at  all  sites.  Each  site  S,  can  determine  a  lower  bound  6,(1.)  on  the  earliest 
completion  time  of  any  of  its  pending  (i.e.,  as  yet  not  begun)  activities,  assuming  no  further 
enable  events  are  received.  We  call  this  the  site’s  lookahead  bound.  I’or  example,  consider  a 
sile  S,  with  queueing.  There  are  three  cases  to  consider. 

Case  1:  .S',  >  event  list  is  void  of  enable  events.  In  this  case  we  define  6,(t)  =  oc. 
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Case  2:  No  activity  is  occurring  at  t,  and  S , s  event  list  contains  enable  events.  Let  u  be 
the  earliest  enable  time  among  these,  and  define  to  be  the  completion  time  of  the 
activity  enabled  at  u. 

Case  3:  Some  activity  is  occurring  at  t,  and  Si ' s  event  list  contains  enable  events.  Define 
6,(t)  to  be  the  completion  time  of  the  next  enabled  activity  to  receive  service,  assuming 
that  no  further  enable  events  will  be  inserted  into  the  event  list. 

If  Si  has  infinite  servers,  only  two  cases  arise.  If  there  are  no  begin  events  in  Si's  event 
list,  then  define  6,(t)  =  oo.  If  there  are  begin  events  in  5,’s  event  list.,  define  8  ft)  to  be  the 
minimum  completion  time  among  these. 

Finally,  define 

6(<)=  min  {8ft)}. 

all  sites  bt 

The  protocol  is  very  simple.  Define  w j  =  0,  and  proceed  as  follows. 


1.  Given  wn,  the  processors  cooperatively  determine  8(wn). 

2.  Each  site  may  be  simulated  in  parallel  with  all  others  until  the  time  of  the  event  with 
least  time-stamp  at  that  site  is  as  large  as  6(ivn).  The  processing  of  any  begin  event 
in  this  interval  must  include  pre-sending  the  associated  completion  messages. 

3.  Sites  receive  the  messages  sent  during  the  processing  of  [u'n,  6(icn)),  select  duration 
times  for  the  associated  caused  activities,  and  insert  events  into  their  event  lists. 

4.  n  =  n  +  1.  Goto  step  1. 


The  obvious  question  to  ask  of  this  protocol  is  whether  the  sites  can  safely  process  all 
events  within  a  window.  The  protocol  is  safe  if,  once  the  window  is  established,  no  further 
messages  with  time-stamps  less  than  the  upper  edge  of  the  window  will  ever  be  sent.  The 
following  theorem  establishes  this  fact. 

Theorem  3.1  Let  [wn,  8(ivn))  be  a  window  established  by  the.  protocol.  Then  every  comple¬ 
tion  message  sent  during  the  processing  of  [wn,  6(wn))  has  a  time-stamp  at  least  as  large  as 
8{wn). 


Proof:  Completion  messages  are  pre-sent  by  the  processing  of  begin  events.  Let  b0. . . .  .  b *. 
be  the  times  of  all  begin  events  in  [ren,  8(w„)),  in  increasing  order.  We  use  induction  to 
show  that  for  i  =  0 the  completion  messages  associated  with  the  begin  event  at 
time  bt  have  time-stamps  at  least  as  large  as  6(?en).  For  the  base  case  consider  />0,  and  let 
Si  be  the  associated  site.  S,  computes  8fwn)  to  be  the  minimum  time-stamp  on  the  next 
message  it  sends,  provided  no  further  messages  are  received  at  S,.  By  construction  6>,  will 
not  receive  any  further  message's  with  time-stamps  less  than  60,  therefore  the  decision  to 
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begin  the  activity  at  b0  was  correct l}’  fore-seen  during  the  computation  of  b,{wn).  implying 
that  the  completion  time  of  the  activity  beginning  at  />0  is  no  smaller  than  &i(wn),  and  hence 
is  no  smaller  than  6(icn).  This  establishes  the  base  of  the  induction.  For  the  induction 
step  suppose  that  the  completion  times  of  the  activities  begun  at  times  b0....,bj~ t  are  all 
no  smaller  than  6(wn).  Consider  the  activity  begun  at  time  bj,  and  let  Sj  be  its  site.  As 
a  consequence  of  the  induction  hypothesis,  during  the  processing  of  [;en,  S(  wn ) )  Sj  cannot 
receive  any  messages  with  time-stamps  less  than  bj.  Consequently,  the  decision  to  begin  an 
activity  at  time  bj  was  correctly  fore-seen  during  the  computation  of  St(irn).  The  completion 
time  ol  the  activity  beginning  at  /q  is  thus  no  smaller  than  6j(wn),  and  so  is  no  smaller  than 
/>(»'„).  l  h is  completes  the  induction. 

□ 

1  nder  1  he  assumpt  ion  of  non-zero  duration  times,  it  will  always  be  true  that  «’n  <  Hu'n)- 
Consequent  ly,  simulat  ion  t  ime  advances  each  window  (even  if  no  events  occur  in  the  window), 
and  deadlock  never  occurs. 

3.3  Example 

An  example  helps  to  illustrate  the  protocol’s  mechanism.  Consider  a  system  with  sites  Sj  and 
Sj.  Site  S' 1  permits  an  unbounded  number  of  activities  to  occur  simultaneously,  while  site 
Sj  imposes  queueing.  The  system  moves  objects  between  sites.  Duration  times  are  random. 
When  an  object  completes  its  duration  it  either  disappears,  moves  to  another  (possibly  the 
same)  site,  or  splits  into  a  number  of  objects  that  move.  5 2  uses  Last-Come-First-Serve 
queueing. 

Let  ten  =  100,  and  imagine  that  objects  0\  and  02  are  present  at  Si,  with  scheduled 
completion  times  of  100  and  103.  Object  Oa  is  in  service  at  Sj,  and  will  complete  at  time 
101.  Object  O. |  is  enqueued  at  Sj ,  and  will  eventually  receive  4  units  of  service. 

The  completion  of  0\  at  time  100  sends  O i  back  to  Si,  where  it  will  receive  another  8 
units  of  service;  the  completion  of  02  at  time  103  sends  02  to  Sj  where  it  will  eventually 
receive  6  units  of  service;  0{ s  completion  at  time  103  also  creates  a  new  object  0,5  which 
is  sent  to  Si,  where  it  receives  4  units  of  service.  At  site  Sj,  03  completes  at  time  101, 
and  then  remains  at  Sj.  where  it  will  receive  another  5  units  of  service.  Observe  that  the 
messages  reporting  the  completions  ofCj,  02 ,  and  Oa  have  already  been  sent,  and  the  “next.” 
durations  of  those  objects  have  already  been  chosen. 

This  scenario  is  summarized  in  figure  1,  along  with  the  contents  of  Sj  and  S/s  event  lists 
as  observed  at  time  100.  The  event  lists  reflect  the  practice  of  pre-sending  object  arrival 
notices.  ,Sj  determines  its  lookahead  bound  ^i(100)  by  finding  the  minimum  completion 
time  among  all  objects  it  knows  will  arrive  at  or  after  time  100.  ()2  arrives  (again)  at  time 
100  and  completes  at  108.  ()h  arrives  at  103  and  completes  at  107.  making  <^(100)  =  107. 
S2  determines  b2(  100)  by  identifying  the  next  object  to  complete  service  that  isn't  already  in 
service.  Because  Sj  is  LCk'S.  t  he  arrival  of  C)A  at  time  101  causes  0:(  to  receive  service  before' 
Of.  A.JI00)  is  1 00,  so  that  <S(100)  =  106.  .Sj  and  S2  are  thus  free  to  simulate  all  events  with 
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Objects  at  time  100 


Object 

Site 

A  rrii'cs 

Duration 

Completes 

Routed 

Comments 

0 1 

Si 

9 

9 

100 

Si 

Occurring  at  time  100 

(h 

Sx 

100 

8 

108 

? 

Caused  by  completion  of  self 

02 

Sx 

9 

9 

103 

S  2 

Occurring  at  time  100 

02 

S2 

10:1 

6 

9 

7 

Caused  by  completion  of  self 

Or, 

Sx 

103 

1 

107 

? 

Caused  by  02  at  .Sj 

o:i 

s2 

? 

9 

101 

s2 

Occurring  at  time  100 

03 

S2 

101 

•r) 

106 

? 

Highest  priority  activity  at  101 

Ox 

S2 

? 

-1 

9 

Lower  priority  at  101 

£(100)  =  106 


Event  Lists 

.Sj  Event  List  at  time  100  Sz  Event  List  at  time  100 


Event 

Time 

Eve  nt 

Time 

Ox  completes 

100 

completes 

101 

0 1  arrives 

100 

03  arrives 

101 

0 2  completes 

103 

02  arrives 

103 

Or,  arrives 

103 

4  events  processed  in  [100.106) 

3  events  processed  in  [100.  106) 

Figure  1:  Example  of  Synchronous  Protocol  Operation 

times  no  greater  than  106,  in  parallel.  Sj  has  four  such  events,  S2  has  three  (or  four,  if  the 
processing  of  the  O 3  arrival  event  at  101  creates  a  begin  event  at  101). 

Arrival  events  (enable  events)  at  .Sj  may  also  serve  as  begin  events  since  no  queueing  is 
imposed.  Each  site's  processing  of  arrival  events  includes  the  decision  of  where  to  route  the 
object  upon  completion,  and  the  generation  of  completion  messages  with  the  appropriate 
t  ime-sl  amp. 

4  Analysis  of  Protocol 

Our  performance  analysis  derives  an  approximated  lower  hound  on  the  mean  window  width, 
then  multiplies  by  the  eqtiilil  mm  event  creation  rate  in  order  to  hound  the  average  number 
of  events  created  per  window.  By  flow  balance  this  hounds  the  average  number  of  events 
processed  per  window.  We  then  consider  the  behavior  of  this  average  as  a  function  of 


simulation  activity  rate,  and  minimum  duration  time. 

file  analysis  to  follow  uses  results  from  the  theory  of  stochastic  order  relations,  and  ma¬ 
nipulates  hazard  rate  functions.  Headers  unfamiliar  with  these*  tools  should  consult  Ross  [25]: 
the  appendix  quickly  sketches  the  main  ideas  and  results  we  use. 

We  are  interested  in  the  limiting  value  of  the  expected  window  width  B[6(u'„)  —  «•„]  as 
n  —>  oc,  supposing  that  the  limit  exists.  As  we  will  see,  a  window's  width  is  comprised 
of  the  minimum  of  a  number  of  complicated  random  variables.  Complications  arise  both 
due  to  randomness  in  the  model  (e.g..  random  selection  of  sites  where  activities  are  caused 
following  a  complet  ion),  and  due  to  dependence  of  t  he  random  variables'  cl  is  t  ri  but  ions  on  t  he 
past  activity  in  t he  simulation.  Our  approach  is  to  bound  the  mean  window  width  from  below 
with  the  mean  minimum  of  much  simpler,  and  stochastically  smaller,  random  variables.  The 
stochastically  smaller  variable’s  are  constructed  by  considering  hazard  rate  functions.  This  is 
a  useful  analytic  trick  which  exploits  the  fact  that  the  hazard  rate  function  for  the  minimum 
of  a  group  of  independent  random  variables  is  just  the  sum  of  their  individual  hazard  rate 
funct  ions. 

One  stop  in  the  bounding  argument  is  intuitive,  but  not  rigorously  justified.  Therefore 
one  can  only  rigorously  call  our  results  approximate. 

The  analysis  uses  a  slightly  more  formal  model  than  we  have  yet  described.  The  duration 
time  distribution  for  site*  .Sj  is  taken  to  be  A),-  +  exp{//,}.  where  l),  >  0  is  constant  and  oxp{//,} 
is  exponential  with  mean  //,  =  l/A,.  We  let  Dinin  be  the  minimum  l),  value  among  all  site's. 
The  discussion  of  random  variables,  means,  and  hazard  rates  all  concern  the  stochastic 
portion  of  the*  duration  times. 

Our  bounds  depend  on  the  manner  in  which  a  completing  activity  causes  activities  else¬ 
where.  To  more  precisely  describe  these  effects,  for  every  site  5,  let  Btach(S ’,)  be  the  set 
of  all  site’s  where  activities  caused  by  a  completion  at  .S',  can  occur.  For  convenience  we 
assume  that  the  activities  caused  by  a  single  completion  are  all  at  different  site's.  Activity 
Aj  completing  at  .S,  randomly  chooses  a  subset  B}  C  Bcucl>(b',).  and  causes  one  activity 
at  e’ach  site  in  Bj.  We  assume  Bj  is  eliose'n  inele'pe'nele'iitly  eT  the*  duration  value’s  of  the 
caused  activities.  The  distribution  governing  this  choice*  is  particular  to  p(B.i)  elenote's 
the-  probability  that  B  C  B<  ach(S,)  is  t  lie*  se-le'cte'el  set. 

Lot  Aj  be’  an  activity  oecurriug  at  site*  .s',,  and  let  B  be  the-  se-1  of  site’s  with  act ivitie.’s 
cause’el  by  A We  will  be*  interested  in  the  rate  at  which  the  first  activity  comple’te’s.  among 
all  t  he>se  cause’el  by  Aj.  Towards  t  his  end.  we  focus  on  t  he-  stoe  hast  ic  port  ion  of  t  hose*  act  ivity 
durations.  1  he  “rate’”  of  the’  minimum  stochastic  portion  is  just  A h  =  Yls  (siV  !i-A.2). 

Die  expect  e.’d  rate  (with  re’s  pert,  to  the-  elist  ribut  ion  of  B)  is  dedinexl  by 

W  =  Y  /'(/TO  (  Y  A  J 

=  Y,  I’r{comple’t  ion  at  .S',  cause’s  an  act  ivity  at  .sj},\(.  (1) 

S'j  C  Hr  nrit  ( .S',  ) 

Rat  hede>giea  1  analytic  difficult  i<’s  tire’  avoideel  by  assuming  that  the’  simul.atieui  me>elej  ;d- 
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ways  has  at  least  one  activity  occurring  that  causes  other  activities.  This  can  he  ensured, 
for  wample.  by  adding  a  "clock"  site'  that  does  nothing  hut  process  a  single,  periodically 
self-causing  act  ivity. 

h('t  A(tr„)  he  the  random  s('t  of  activities  occurring  at  time  ir„.  During  most  of  the 
analysis  to  follow  we  will  condition  on  knowing  that  .A(er„)  is  some'  fixed  subset  \  .  1  he 

conditioning  is  undone  later  when  we  take  an  expectation  with  respect  to  the  elist  ribut  io,i 
of  -4((C„). 

The-  discussion  to  follow  focuses  on  activities,  fo  facilitate  precise  re'ferene e-  to  the  site 
where'  a  give'll  activity  occurs,  we  often  use'  the*  negation  .V,(p  te>  describe*  the-  site'  at  which 
activity  Aj  occurs. 

Consiele-r  the'  e-onst  met  ion  e)f  the'  nth  winelenv.  Te>  ceunpute-  es(ir,i)  we  e-xamine  t’acli  site 
to  eletermine*  the*  time1  of  the  newt  message  it  wiil  se'nel.  in  the-  abse-nee-  e)f  receiving  anv 
furl  lie'r  me-ssages.  heir  a  non-  queue-mg  sit  e-  .S',  this  is  the*  minimum  com  ple't  ion  t  ime*  among  all 
activities  with  begin  eve-nts  in  .S','s  e'vent  list,  for  each  siuh  aet  ivity  t  he're*  is  another  which 
e'aused  it.  anel  which  is  eiccurring  at  time  wn.  If  .S',  is  a  queue-ing  site-,  t  he*  activity  .1,  whose- 
ceunplet ion  ele-fine-s  le>okaheael  hound  is  either  e-nepie-ue-el  waiting  for  the  conipiot ion  of  an 
occurring  activity  at  5,,  or  has  its  enable  event  sometime  in  the  future.  In  the  latter  case  we 
knew  there  must  he  another  site  with  an  activity  which  is  occurring  at  time  w„.  and  which 
cause-s  Aj.  1  herefore'.  e-ve-ry  activity  whose  coinplet ion  ele-fine-s  se)me>  looka  1  ie*a el  he>n nel  can  lee 
assoeuale-el  with  an  activity  eeecurring  at  ir„.  Conve-rsely,  for  every  A,  G  \  we-  can  associate- 
a  se-t  of  site-s  Ct  with  aetivitie-s  caused  by  Aj,  such  that  the  completion  time*  of  e-acli  activity 
U  €  (')  e-epials  e^ipfu-,,).  Is  obviously  a  subse-t  of  lir  the  set  e>f  all  sites  with  activitie-s 
cause-el  by  Ar  see  that  the-  minimum  comple-t  ieen  time-  among  all  activitie-s  cause-el  by  Aj  at 
site-s  in  li 3  is  no  large-r  than  the-  minimum  take-n  ove-r  C r 

We-  will  want  to  elist inguish  queueing  site-s  fre)in  non-epie’iie'ing  site-s.  We*  therefore  elefine 
the-  inrlicator  ee>e-ffieie-nt  y,  1o  have-  value  1  if  site  5,  is  a  epie-ueing  site,  anel  to  have-  value  0  if 
not . 

For  e-verv  A,  E  V  le-t  //,(((-„)  ele-iute-  the  rtsidiuil  el ura t ion  time-  e>f  Aj-  the  difference' 
betwe-en  A  ceunplet  ieni  time  anel  w„.  For  every  A,  G  V  at  a  epieue-ing  site  Ss(j)  ele-fme 
Ajf  ti'rl )  l.o  be  oc  if  =  oc,  etherwise  it  is  the-  duration  e)f  the-  enqueue-d  activity  whose 
complet  ion  time  is  <\,(/)(  bet  /•,’,(  tr„ )  be-  t  he-  enabling  t  ime-  e)f  the  activity  defining 
Observe-  that  this  activity  is  sensitive'  to  ir„:  if  ,\,  was  eiccurring  at  time  r/*„ _ j  it  is  possible* 
for  a  higher  priority  act  ivity  t  o  be-  enable-el  be-twe-e-n  time's  er„_i  anel  ie„,  so  that  t  he  act  ivit  ie-s 
elefining  anel  A  , ( )  may  be  elille-re-nt . 

Wr  elefine-  X,(tru)  =  oc  if  is  a  neui-epie-ueing  site-.  Re-gardle-ss  of  whether  S’qp  is  a 
epie-ue-ing  or  neinepie-ue-ing  site-  we-  may  say  that  the-  completion  rale-  e)f  .\,(ir„)  s  stoe'hastic 
port  ion  is  y,(j)A,(;). 

Again,  le-t  be' the*  set  eif  site-s  wit  h  ae  I  ivit  ie-s  ca  use-el  by.  Ip  for  e-ae'h  A  E  H ,  le-t  /)*■  +  !  j.k 
lee-  the-  duration  of  the-  aetivity  at  .s\  cause-el  by  Ar  We*  de-fine-  ,l;'s  loeikahe-ael  bounel  to  be 
the-  minimum  ceunple-t ion  time-  among  (i)  the-  activitie-s  cause-el  by  Ar  (ii)  the-  ne-xt  activity 
te>  r'om plr-t e-  at  if  .S,s( p  is  nemepie-ue-ing  anel  re-ee-ive-s  no  further  enable  e-ve-nts.  A ,  s 


lookahead  hound  as  measured  at  time  ir„  may  he  written  as 


K/.ii'r.)  =  d-  /» +  min{  maxjl).  /,’ {tr„)  -  A,(u\,)}  +  -V, min  {/A-  +  T,,*}}}. 

•*A  e 

ev(  j  is  t  he  minimum  lookahead  hound  among  all  act  ivit  ies  .-1  6  1  ’.  We  may  t  here'fore*  write 

/,-'[<s(ee„)  -  tc,i  |  .4(  t"» )  =  V]  =  /:'[  min  {A'j(te„)}] 

•h  e  v 

>  /•;[  min  {ltj(u\,)  +  min {.Vj( min  { /A-  +  V  '  } } } ] 

Aj  e  v  e  Bj 

C->) 

I’lie  ex  pert  at  ion  above*  is  complicated  by  its  dependence  on  the  history  of  t  he  synch ronizat  ion 
behavior  up  to  time  trn.  lor  example*.  suppose  that  activity  .1,  began  in  the  (n  —  /y)th 
window,  for  some  /y  >  0.  The  distribution  of  i\j(u'n)  must  be  conditioned  on  the  event 
Gj{u'n)  that  A ,(icn_c)  >  icn_c+  j  for  all  1  <  c  <  br  Since  Kj(irn)  is  larg  ly  comprised  of 
random  variables  that  also  comprise  KJ(u\l_c)  for  each  c.  conditioning  on  Qdwn)  makes 
each  Kj{  ir„)  probabilistically  larger  than  it  would  be  if  each  component  random  variable 
had  its  original,  unconditional  distribution.  The  starting  point  for  our  bound  is  to  build  a 
stoehast  ically  smaller  replacement  for  each  A’,(«\, )  by  replacing  each  of  A  ;(  tcn)’s  components 
with  a  pristine  unconditional  random  variable  with  the  appropriate  distribution. 

We  construct  an  "unconditioned"  lookahead  variable  for  each  .4,  as  follows.  Randomly 
choose  a  subsol  lE(j)  Q  E<  < icli(S s( q )  in  accordance  with  the  probability  (list  ri  but  ion  { p(  U,s(j))}. 
and  independently  choose  a  duration  time  I )k  +  Xjk  for  each  S\.  £  Us(})-  +  Xj.k  will  re¬ 

place'  the  actual  corresponding  duration  time  /}*..  + Vjg..  Randomly  and  independently  choose 
some'  value  /.),(,)  +  U  /t.,(q  from  A,(j)'s  duration  time  distribution.  If  .S'.qq  is  a  non-queueing 
site  wo  take  =  oc.  Ds[j)  +  U^.q,)  will  replace  the  actual  .V;(te„).  Let  Zj,s(i)  be  an. 

independent  exponential  having  the  distribution  ol  the  stochastic  portion  of  Aqp's  duration 
time.  Zj.a(j)  will  replace'  /?,(»•„);  note  that  the  residual  of  an  Rs(j)  duration  time  is  always  as 
large  as  the  re'sielual  of  the'  eluration  time’s  stochastic  portion. 

I'  he*  even  I  give's  us  information  that  l\j(ic„)  is  probabilistically  larger  than  it 

woulei  be  if  its  components  had  their  original  elist  ri bu t  ions.  Therefore,  intuition  suggests 
that  the-  following  ine'quality  is  true 

I im  /v [/>(t/’„  )-»’„]  >  1  i tn  A’ [  min  {/  (  q-f min{  Ds{  q  +  U' a,  min  {  } } } ]• 

(3) 

Note'  that  the-  expectations  involved  in  this  assumption  are  not  conditione'd  on  A(wn)  =  V, 
ami  that  we-  only  re'quire  the  inequality  to  hold  in  the  limit  of  n  — >  oo.  It  se'ems  exceedingly 
e  1  i f f i f u  1 1  t (>  formally  establish  this  bound.  Our  analysis  therefore  proceeds  by  assuming  its 
validity. 

Assumption  4.1  Inn/uahti/  (.1)  is  Inn . 
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We  continue  the  analysis  by  placing  stochastic  lower  bounds  on  variables  comprising  the 
conditional  (on  A(irn)  =  V)  expectation 

/■:[  min  {Z 

jMj )  +  min{  Ds{j)  +  U  jMj),  min  { Dk  +  \hk] }  }]•  (  U 

Aj  £  \  ±k€"sij) 

As  a  first  step  we  note  that 

min  { lh  +  >'j.k }  >  min  {A';.*}  +  (•*>) 

€  W*(j)  *sik  € 

Next  we  put  a  stochast  ic  lower  bound  on  min{  A’j.jt |.S\.  6  j ) } ■  I  h is  random  variable  is 
complicated  by  the  fact  that  Us(j )  is  a  random  set.  For  any  given  set  U3{j)  —  /?;•  the 
minimum  of  \B,\  exponentials  is  itsell  an  exponential,  with  rate  A^,  =  J2skeB,  ^ *•••  Conse¬ 
quently  min{  A  hk\$k  6  £/s(j)}  >s  a  probabilistic  mixture  of  exponentials — with  probability 
p(  B,.  s(j))  it  is  an  exponential  with  rate  \Bt.  Without  loss  of  generality  we  may  enumerate 
all  subsets  B ,  C  Re ach(Ss^j))  in  such  a  way  that  A <  A Bj  whenever  i  <  j.  Given  this  or¬ 
dering,  Lemma  A.l  establishes  that  an  exponential  l\s(j)  whose  rate  is  the  "expected"  rate 
=  J2h,  p{Bt.  s(j))\B  (see  expression  (1  ))  is  stochastically  smaller  than  the  minimum: 

min  {A',.*}  >„  TjMj). 

■Sfc  e  O.(J) 

Applying  inequalities  (5)  and  (11)  we  determine  that 

min {/d,(,)  +  H  min  {Dk  +  A',,*}} 

■‘'k  €  «,(j) 

>  min  {D«j)  +  WjMj)'  '»•»  { A'j,* }  +  Dmi„ } 

•S*  6  «,(j) 

'As!  min{  /},„!„  +  I  ,:.s(j)  ~F  /Anin} 

mm { It  j Tj }  -4-  Binin.  (6) 

Since  It rj,s(j)  an(l  TjMJ)  arc  hoth  exponential,  their  minimum  is  also  exponential  and  has  rate 
7,,^) A,(j)  -f  (recall  that  =  0  and  ^  =  00  if  ^s(j)  >s  a  non-queueing  site).  Let 

l'j,s(j)  he  an  exponential  with  rate  ls(j)^s(j)  -t-  t Inequality  (6)  holds  for  every  Aj  6  I'; 
furthermore,  the  lookahead  random  variable  constructed  for  each  A:  is  independent  of  all 
others.  Since  the  addition  and  min  operators  are  increasing  it  follows  from  (11)  that  the 
expectation  in  (4)  is  bounded  from  below: 

/•;[  min  {ZjMj)  +  min  {/.)s(  J(  +  min  {Dk  +  Xj,k}}}] 

■‘'j  €  1  Sk  e  U.(j) 

>  '"j"  Aj)  +  Uj,*U)}]  +  D  ruin*  (  1  ) 

Aj  €  v 

We  remove  the  conditioning  on  \  by  taking  the  expectation  with  respect  to  A{ir„). 
i'[  min  {ZjMj)  +  min{/;s0)  +  .min  {l)k  +  Aj>a  } }}] 

•tj  €  A(w„)  X£tf,l)| 

>  h\  min  {ZjAj)  +  ^j.*(7)}]  +  V  min  • 

■tj  6  -4(irn) 
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If  A  h  as  the  limiting  distribution  of  A{  wn )  as  n  — *  oc  (supposing  it  exists),  then  | 

Jim  E[  min  {ZiAj)  +  min {D,(j)  +  \VjMj),  min  {Dk  +  X]k } } } ]  , 

n  °°  Aj  e  €  Us{j) 

—  min  {Zha(j)  +  f-'j.sO)}]  4-  ^min-  (8) 

Aj  €  A 

Our  next  task  is  to  deal  with  the  randomness  of  the  set  A.  I 

Let  the  collection  of  site  servers  in  the  domain  be  enumerated  as  . and  define 

v(  j  )  to  be  t  he  index  of  Ops  site.  For  each  i  —  1 , 2, .  . . ,  S  and  j  =  1,2,...  let 


u>i  —  lim  t’lnumber  of  site  St  activities  occurring  at  time  rrJ, 

n— *00  1 

p,  =  lim  Prfat  time  wn  an  activity  is  occurring  at  V’ }. 

and  observe  that 

u-,‘  —  Pj' 

Vj,v(j)=i 

assuming  that  the  expectations  and  limits  exist.  It  is  not  obvious  that  should  be  identical 
to  the  equilibrium  expected  number  of  activities  occurring  at  St\  intuitively  one  expects  it 
to  be  close,  because  the  number  of  windows  in  which  a  given  activity  is  found  occurring  is 
roughly  proportional  to  the  duration  of  the  activity. 

The  expectation  on  the  right-hand-side  of  (8)  is  taken  with  respect  to  a  distribution  of 
random  sets  of  activities  found  occurring  at  a  window  edge.  One  can  equivalently  view  it  as 
an  expectation  taken  with  respect  to  a  random  set  of  servers  found  busy  at  a  window  edge. 
Inequality  (7)  suggests  we  associate  two  exponentials  with  each  server  Vj\  Zj  and  l’}  (here 
binding  j  to  the  server  rather  than  to  the  activity).  There  is  a  one-to-one  correspondence 
between  a  random  subset  of  servers,  and  a  random  subset  II  C  {(Zi,  U\ ),  (Z2,  £/2),  •  •  • .  }• 
Lemma  A. 2  was  developed  to  deal  with  the  situation  at  hand.  Following  its  statement 
we  define 

OO 

A  =  Pr {[Zj,V})  €  +  il'v(j)) 

j=t 

<X 

=  'jL,  Pj^v(j)(liij)^i(j)  +  Vv(j )) 

j= > 

5 

=  +  0,).  (9) 

1=1 


Lhe  lemma’s  conclusion  is  that 


E  [  min 
{Z,.U3)GH 


The  left-hand-side  of  this  inequality  is  identical  to  the  right-hand-side  of  (8),  except  for  the 
inclusion  of  l)u ,jn.  Assuming  the  validity  of  assumption  1.1  we  may  conclude  that 


lim  E[S(wn)  -  u’n] 

n-^oc- 


(10) 
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In  order  to  determine  the  average  number  of  events  processed  per  window  we  need  to 
consider  the  rate  at  which  events  are  generated  by  the  simulation,  bet  T,  be  the  equilibrium 
mean  number  of  activities  occurring  at  A, .  There  arc  two  events  associated  with  each  aotivitv 
at  a  non-queueing  site,  begin  and  complete  .  Adding  enable  ,  there  are  three  events 
associated  with  an  activity  at  a  queueing  site.  We  therefore  define  the  variable  e ,  to  be  2  or 
•1  depending  on  whether  .S',  is  non-queueing  or  queueing,  respectively. 

An  activity's  duration  at  .s’,  lias  mean  A*/1  =  1  /(l)t  -f  //,),  so  that  the  equilibrium  event 
creation  rate  is  A svs  =  Hf=1  < fs2yAj  By  flow  balance  this  is  also  the  equilibrium  event 
completion  rate.  We  can  therefore  multiply  this  rate  times  the  lower  bound  on  the  mean 
window  width  to  bound  the  mean  number  of  events  processed  in  a  window. 


Theorem  4.2  Let 

A.Sy.s  =  y,  <  ' 

1=1 

he  the  system  event  creation  rate,  and  let 

s 

A  =  r.  ^'i  A ,( ~ ,  A ,  + 


I'ltcn  if  assumption  f.l  is  valid,  the  average  number  of  events  processed  per  window  is  at 
least 


□ 


This  theorem  demonstrates  how  an  existing  minimal  service  time  accelerates  performance. 
Given  constant  D„ „„  >  0,  the  bound  increases  at  least  linearly  as  the  total  simulation  event 
rate  increases.  However,  good  performance  is  also  possible  when  /2niill  =  0,  as  we  will  see. 

The  value  of  A  is  defined  in  terms  of  u,y.  We  have  no  immediate  cause  for  believing  that 
ay  =  u),-;  nor  is  it  clear  that  the  two  quantities  should  be  widely  different.  It  seems  reasonable 
then  to  take  ay  %  ay  as  a  first  approximation.  Doing  so  permits  us  to  analytically  estimate 
A  in  some  simple  cases,  and  quantify  the  bound  given  by  Theorem  1.2. 

As  pointed  out  by  Wagner  and  Lazowska  {30],  interconnection  topology  plays  an  impor¬ 
tant  role  in  determining  the  performance  one  achieves  with  a  queueing  system.  Network 
bottlenecks  limit  the  volume  of  simulation  activity.  This  is  reflected  in  Theorem  1.2.  Tor 
example,  in  a  network  where  each  site  has  one  server,  ay  is  approximately  the  server  uti¬ 
lization.  A  bottleneck  site  will  have  a  very  high  utilization  while  those  at  other  sites  are 
comparatively  low.  After  a  point,  adding  jobs  to  the  network  dot's  not  appreciably  increase 
the  sum  of  server  utilizations,  hence  t he  overall  event  rate  does  not  appreciably  increase.  Tor 
1  he  same  reason  simulated  queueing  systems  are  constrained  oven  il  the  throughput  at  each 
site  is  equal.  The  overall  system  event  rate  is  maximized  when  all  site  utilizations  art'  one. 
After  a  point,  to  increase  simulation  activity  one  needs  to  increase  the  size  of  the  network. 
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\vv  can  approximate  the  hound  in  I  iieorem  1.2  in  sonic  simple  eases.  Consider  a  mode] 
where  objects  move  throughout  the  domain.  An  object  resides  at  a  site  for  a  fixed  time  I),,,,,, 
plus  an  exponential  time  with  mean  1/A.  and  then  moves  to  any  other  site,  chosen  uniformly 
at  random,  equilibrium  flow  balance  equations  are  easily  solved  in  this  situation.  Working 
through  the  details  with  l\  objects  and  I)mm  —  t).  one  discovers  that  at  least  \J~!\  /  2  events 
are  processed  per  window,  on  average.  A  relevant  point  is  that  the  inter-site  communicat ion 
topology  is  that  of  a  fully  connected  graph.  Such  topologies  are  generally  taken  to  be 
extremely  taxing  on  conservative  synchronization  methods,  because  the  "next"  event  at  a 
site  can  come  from  anywhere.  Nevertheless,  a  significant  amount  of  work  is  performed  each 
window,  at  least  when  I\  is  large.  Figure  2  plot s  t  he  analvt  ically  hounded  and  empirically 
measured  average  number  of  events  processed  per  window,  as  a  function  of  log 2  l\  .  The 
empirical  measurements  represent  the  sample  mean  often  long  simulat  ion  runs.  There  was 
very  lit  t  le  variance  between  these  runs.  Figure  2  shows  that  if  I  housands  of  objects  are  in  the 
model,  hundreds  of  events  a  re  processed  each  window.  Since  parallel  processing  techniques 
are  used  primarily  when  serial  processing  times  are  too  slow  (or  memories  are  too  small), 
we  see  that  this  result  applies  directly  to  situations  of  practical  interest  large  simulation 
models  on  medium  scale  parallel  architectures. 

Performance  is  greatly  enhanced  when  Dmm  >  0.  Figure  3  plots  measurements  of  the 
number  of  events  per  window  for  small  models,  having  only  256  and  102-1  objects.  The  same 
measurement  methodology  as  was  described  for  Figure  2  is  used  here.  The  analytic  bound  is 
not  displayed,  being  indistinguishable  from  the  measured  performance  when  plotted  on  the 
graph.  Dm in  is  varied  between  0  and  ft  =  i/A,  so  that  Dmm/fi  varies  between  0  and  1.  We 
see  that  if  a  model  has  minimal  duration  times  we  can  expect  many  more  events  per  window 
than  if  not.  Note  that  the  protocol  does  not  need  to  know  D,nm,  as  it  is  already  part  of  the 
pre-sampled  duration  times.  Dramatic  performance  improvement  as  one's  ability  to  ‘‘look 
ahead”  increases  has  also  been  observed  by  Fujimoto  [G] . 

Our  confidence  in  the  conclusions  of  Theorem  4.2  is  increased  by  the  fact  that  the  approx¬ 
imated  lower  bound  did  uniformly  fall  below  measured  performance.  Similar  results  have 
been  observed  when  comparing  the  measured  and  bounded  performance  on  less  homogeneous 
simulation  models. 


5  The  Cost  of  Conservative  Synchronization 

Next  we  consider  the  overheads  involved  in  implementing  this  conservative  protocol.  First 
we  identify  conditions  under  which  the  average  number  of  events  processed  per  window 
will  grow  without  bound  as  the  system  event  creation  rate  grows  without  bound.  Then 
we  show  that  as  the  number  of  events  processed  per  window  grows,  our  method's  pur- 
event  overhead  due  to  synchronization,  processor  idle  time,  lookahead  calculation,  and  event 
list  manipulation  becomes  within  a  constant  factor  of  average  the  per-event  overhead  of 
performing  the  simulation  serially. 


One  way  of  increasing  the  system  event  creation  rate  A  is  to  increase  the  “size”  of  the 
|  model.  I* or  example,  we  increase  the  size  of  the  moving  objects  simulation  described  earlier 

'  by  increasing  the  number  of  objects  in  tin*  domain.  We  may  also  increase  the  number  of 

|  sites,  although  in  this  case  it  is  not  necessary.  Theorem  i.2  shows  how  the  average  number 

of  events  processed  each  window  may  increase  as  A>(/s  increases.  Clearly,  if  Pmm  >  0  then 
;  at  least  \Sus  I  hum  events  are  processed  each  window  on  average.  It  is  also  possible  for  the 

average  number  of  events  to  increase  without  bound  as  A .yv«  increase's  even  when  /)min  =  0. 
i  lor  example,  suppose  there  is  a  value  n  such  that,  as  the  size  of  the  simulation  model  is 

l  increase'll  the  following  bound  is  true  for  all  sites  .S',: 


“‘VM  'V  +  ) 


Ibis  condition  is  a  formal  statement  that  as  the  model  size  grows  t.’,  can  t  get  too  large 
relative  to  A,,  and  that  any  difference  between  uz,  and  uz,  doesn't  get  too  large.  Idle  first 
condition  will  be  satisfied  if  there  exists  Ain;ix  and  /flnav  such  that  as  the  model  size  grows. 
A,  C  Aln;ix  and  j/A  tirh(S,  )|  <  /fnifix,  for  all  /.  The  second  condition  ought  to  be  satisfied 
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if  our  intuition  that  u-’,  %  T,  is  correct .  If  the  bound  above  holds,  then  as  the  model  size 
grows  tiie  inequality  A  <  rv A >■  v.,  will  always  hold.  It  follows  that  at  least  \J  ,5/('2o  )  events 
are  processed  each  window  on  average,  a  number  that  grows  without  bound  as  A,s9,  grows 
without  bound. 

As  a  point  of  comparison,  we  assume  that  a  serial  implementation  uses  the  best  known 
event  list  management  algorithm.  If  there  are  7  total  events  in  the  system  on  average,  we  let 
/(  /  )  be  t  he  average  complexity  of  an  optimized  serial  event  list  algorithm.  For  example,  there 
is  some  evidence  that  a  “calendar-queue'1  implementation  has  an  average  0(1)  complexity 
(i.e.,  f(T)  =  1)  on  the  hold  model  [3].  A  number  of  other  event  list  algorithms  exhibiting 
fl(logV)  average  complexity  are  also  commonly  used  [8].  We  assume  that  the  serial  event 
list  algorithm  permits  the  deletion  of  a  lion-minimal  element  without  affecting  the  overall 
average  complexity.  This  assumption  is  satisfied  by  the  calendar  queue  implementation. 

We  make  the  reasonable  assumption  that  as  the  simulation  model  size  is  increased,  /’ 
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"rows  at  least  as  rapidly  as  .s'.  i.<\.  /  =  [l(. s'). 

Now  consider  a  parallel  simulation  that  uses  our  protocol.  The  requirement  that  comple¬ 
tion  notice's  lie  present  may  increase'  I  In'  average  total  number  of  events  in  event  lists  at  a 
time.  I  his  represents  a  factor  ol  two  increase*,  at  most.  In  the*  complexity  analysis  to  follow 
we  need  not  explicitly  concern  ourselves  with  this  constant  factor  increase. 

One  overhead  suffered  in  the'  parallel  simulation  is  the  cost  of  computing  0 (»'„)  for  each 
site-  Ill  is  value*  change's  only  when  an  event  is  insert  e'd  or  deleted  at  S,.  A  queueing  site 
can  recompute  t  his  value*  wit  h  Of  1 )  cost  whenever  its  event  list  change's.  A  non-queueing  site- 
can  organize  t  lie  comp  let  ion  t  imes  oi  its  pending  act  ivit  ies  in  a  priority  queue  using  whatever 
event  list  algorithm  is  e-mployeel  by  the  optimized  serial  implementation.  The  minimum  value 
in  the  priority  queue  defines  <*>,.  The  priority  queue  is  modified  only  when  the*  site's  event 
list  is  modified,  at  cost  (H)(1)).  A  processor  can  organize  t  he  <*),  values  from  each  of  its  sites 
into  priority  queue,  enabling  it  to  determine  the-  minimum  on-processor  <\  value*  at  least  as 
quickly  as  the  optimized  serial  implementation  finds  its  minimal  element.  Maintenance  of 
this  priority  queue  costs  ()(J(S))  on  average  for  each  processed  event.  Once  each  processor 
has  determined  its  locally  minimum  <*>,  value,  all  processors  may  cooperatively  compute  the 
global  minimum  in  f time.  Note  that  our  assumption  that  P  is  fixed  permits  us  to 
ascribe  a  worst -case  constant  cost  to  this  operation. 

Another  overhead  is  processor  idle  time.  The  protocol  is  punctuated  with  global  synchro¬ 
nizations.  between  which  the  processors  execute  in  parallel.  A  processor  with  little  workload 
will  spend  a  long  lime  waiting  for  more  heavily  loaded  processors  to  reach  the  synchro¬ 
nization  barrier.  Suppose  then*  are  U  events  to  process  in  a  window.  For  the  purposes  of 
analysis,  assume  that  each  event  may  be  mapped  to  any  processor,  with  equal  probability1. 
I  hen  t  he  number  of  (-vents  assigned  to  a  processor  is  a  binomial  B(  U  .  1  /  P)  random  variable. 

I  he  collect  ion  of  workload  random  variables  are  not  independent  however,  as  we  know  they 
must  sum  to  IT.  However,  it  isn't  difficult  to  construct  a  coupling[2o]  argument  to  show  that 
the  expected  maximum  workload  of  this  system  must  be  smaller  than  the  expected  maxi¬ 
mum  workload  of  a  system  when*  each  proc<*ssor  has  an  independent.  Ii(\\  .  1  / P)  workload. 
Fhe  binomial  distribution  has  an  increasing  hazard  rate  function  [25](p.280);  it  is  therefore 
stochastically  less  variable  than  an  exponential  with  the  same  mean  [2o]( p.273).  and  hence 
the  expected  maximum  of  P  independent  exponential  random  variables  with  mean  W  / P  is 
at  least  as  largo  as  the  expected  maximum  of  P  independent  B(\\\  1  /  P)  random  variables. 
Ihe  expected  maximum  of  the  exponentials  is  approximately  (\\  /  P)  \n(W  /  P) .  Assuming 
each  (-vent  takes  the  same  amount  of  t  ime  to  process,  t  he  average  fract  ion  of  t  into  a  processor 
is  left  idle  is  no  greater  than 

(W/P)  1 

1  (W/P)\nP  ~  1  ~  UxP' 

I  bis  implies  that  the  average  overhead  cost  per  event  due  to  processor  idleness  is  0(1).  This 

This  can't  rigorously  be  true,  since  events  at  the  same  site  are  evaluated  on  the-  same  processor.  It  is  a 
reasonable  approximation  when  It'  is  large  compared  with  I1. 
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analysis  is  actually  quite  pessimistic.  Much  better  load  balance  can  be  achieved  through 
use  ol  mapping  techniques  such  as  scatter  decomposit ion[22].  Also,  the  bound  above  is 
insensitive1  to  increasing  volume  of  workload,  whereas  in  practice  the  proportion  of  idle  time 
lends  to  decrease  as  the  volume  of  workload  increases. 

I  lie  complexity  ol  the  average  per  event  overhead  due  to  event  list  manipulation,  looka¬ 
head  calculations,  processor  idle  time,  and  global  synchronization  is 


U'(0{f{T))  +  0(f(S))  +  0{\))  +  CSHnrh 

ir 


=  0(1(1')). 


Relative  to  the  serial  simulation,  performance  must  then  be  within  a  constant  factor  of 
optimal,  at  least  if  inter-/.  P  communication  costs  are  ignored  (wo  have  already  accounted 
for  th<'  communication  costs  that  are  specific  to  the  protocol).  Inter-/./'*  communication 
costs  are  dependent  on  the  simulation  model  and  its  mapping,  and  are  dependent  on  the 
architecture.  It  is  possible  for  communication  costs  to  overwhelm  performance,  oven  if  our 
protocol  finds  a  great  deal  of  parallel  workload.  However  these  costs  are  inherent  to  the 
model,  and  would  be  suffered  under  any  synchronization  protocol. 

6  Empirical  Results 

We  used  the  protocol  analyzed  in  this  paper  in  a  parallel  discrete-event  simulation  testbed  im¬ 
plemented  on  the  Intel  iPSC/2  distributed  memory  multiprocessor^].  The  testbed.  YAWNS 
(Yet  Another  Windowing  Network  Simulator)[2l],  is  designed  to  permit  rapid  development 
of  simulation  models,  by  providing  a  framework  within  which  all  synchronization  and  inter- 
processor  communication  activity  is  automated,  and  hidden  from  the  user.  YAWNS  uses 
a  computational  paradigm  where  the  simulation  model  is  decomposed  into  communicating 
Logical  Processes  (LPs).  LP's  interact  by  passing  messages.  A  site  in  our  analytic  model 
plays  the  role  of  an  LP. 

I  he  simulation  modeler  must  provide  the  testbed  with  three  routines  for  each  I.P  (the 
TP  s  may  share  these  routines).  One  routine  processes  messages,  typically  inserting  an  event 
into  the  LP' s  event  list  as  a  result.  This  routine  is  responsible  for  choosing  a  duration  time 
for  the  enabled  activity.  Another  routine  processes  events.  Messages  to  other  LPs  may  be 
generated  as  a  result  of  calling  this  routine;  these  messages  correspond  to  the  completion 
messages  described  in  the  analytic  model.  The  third  routine  is  called  to  obtain  the  lookahead 
value  required  of  an  LP.  YAWNS  demands  that  the  simulation  modeler  know  about  the 
protocol  only  to  the  extent  that  inter-TP  messages  are  pre-sent,  and  an  LP  must  be  able  to 
determine  a  lower  bound  on  the  time  of  the  next  message  it  sends. 

It  is  always  important  to  use  the  best  possible  event  list  algorithm  for  an  LP.  YAWNS 
provides  a  linearly-linked  list  algorithm  for  use  when  the  number  of  events  in  an  LP's  list  is 
small,  and  a  splay-tree  algorithm  for  large  lists. 

We  report  on  the  performance  achieved  by  four  diverse  applications:  the  moving  objects 
simulation  described  earlier,  a  logic  network,  the  game  of  Life,  and  a  timed  Petri  net  model. 
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All  measurements  reported  are  taken  from  a  thirty-two  processor  machine.  Each  simulation 
model  was  run  long  enough  to  generate  several  millions  of  events.  The  execution  time  was 
typically  a  minute  or  two.  once  the  problem  was  loaded  and  running.  Much  longer  runs  were? 
also  performed,  but  no  appreciable  difference  in  performance  statistics  were  observed. 

The  measured  perlormance  supports  our  analysis,  and  actually  becomes  quite  good  on 
large  problems.  The  metric  we  use  to  gauge  performance  is  average  processor  utilization, 
measured  as  the  fraction  of  time  a  processor  spends  doing  work  that  would  be  performed  in 
a  serial  implementation  of  the  simulation,  using  the  same  paradigm.  Time  spent  in  comput¬ 
ing  lookahead,  synchronization,  interprocessor  communication,  and  idle  time  arc'  explicitly 
counted  as  overhead,  and  do  not  appear  in  the  utilization  figure.  One  can  translate  such 
efficiencies  into  "speedup"  figures  by  multiplying  by  the  number  of  processors  used,  provided 
the  resulting  numbers  are  properly  interpreted.  The  speedup  so  computed  is  relative  to  a 
serial  version  that  uses  the' same  paradigm  (and  code)  of  communicating  l.P s  as  is  used  in 
the  parallel  version.  This  is  not  an  unreasonable  paradigm  for  a  general  purpose  serial  simu¬ 
lation  system,  but  is  not  likely  to  be  the  paradigm  of  choice  for  a  serial  version  that  is  highly 
optimized  for  the  given  application.  In  our  experience  (and  depending  on  the  application), 
the  communicating  l.P  paradigm  is  a  factor  of  1.5  to  2  times  slower  than  an  optimized  serial 
version.  The  usual  comparison  of  serial  running  time  to  parallel  running  time  is  impossible 
to  directly  obtain,  as  the  largest  models  we  simulate  are  too  large  for  a  single  processor's 
memory.  We  will  see  that  on  the  largest  problems  the  average  processor  utilization  ranges 
from  60%  -  90%. 

6.1  Moving  Objects 

The  site's  are  connected  in  a  hypercube  topology.  In  each  model  there  are  exactly  as  many 
objects  as  there  are  sites.  Each  object  resides  at  a  site  for  a  time  constructed  by  adding 
0.25  to  an  exponential  with  mean  1.  We  increase  the  size  of  the  problem  by  simultaneously 
increasing  the  number  of  objects  and  the  number  of  sites.  We  may  therefore  describe  the 
size  of  the  system  by  the  dimension  of  the  underlying  hvpercube.  P re-sent  completion  times 
and  lookahead  values  are  computed  exactly  as  described  for  non-queueing  sites  in  this  paper. 
Average  processor  utilization  p  as  a  function  of  hypercube  dimension  is  given  below.  Many 
simulation  runs  were  performed,  the  variance  in  the  timing  numbers  is  quite  small. 


Dim 
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P 

21% 

28% 

34% 

46% 

54%  60% 

62% 

6.2  Logic  Network 

To  ensure  that  we  simulated  networks  with  high  concurrency  we  constructed  "random"  logic 
networks  having  the  topology  of  a  butterfly  interconnection  network.  The  last  stage  wraps 
around  to  feed  the  first.  Each  gate  was  randomly  assigned  to  be  an  AND,  OH,  or  XOR 
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function  and  was  given  a  randomly  chosen  gate  delay  time  of  1.2.  or  3  time  units.  Each  gate 
was  modeled  as  an  LP.  Hie  eventual  output  ol  a  gate  whose  inputs  have  changed  can  he 
computed  at  t  he  time  the  inputs  change,  lienee  gate  state  changes  can  he  pro-sent.  A  gate  is 
like  a  non-queueing  site;  its  lookahead  is  computed  to  he  the  gate  delay  plus  the  minimum 
time  of  the  next  input  change.  The  size  of  network  can  he  described  by  the  dimension  of  a 
column  of  gates,  lor  example,  a  network  of  dimension  0  has  6  columns,  each  composed  of 
2’’  gates.  Observed  performance  is  given  below. 


6.3  Conway’s  Game  of  Life 

Initial  random  configurations  were  chosen  so  that  the  probability  of  a  cell  being  alive  is  at 
step  0  is  0.2.  Each  cell  is  modeled  as  an  LP.  A  cell  is  evaluated  at  step  n  only  if  one  of 
its  neighbors  (or  itself)  is  alive1  at  step  n  —  1.  It  is  straightforward  to  pre-send  “new  state" 
message's;  lookahead  consists  of  one  step  time.  The  problem  size  is  increased  by  increasing 
the  size*  eif  the*  hoard.  Again,  we  can  easily  describe  problem  size  in  leu-ms  of  dimension.  A 
2J  x  '2J  hoard  will  be  saiel  to  have  dimension  j. 


Larger  problems  than  a  256x250  board  will  often  exhaust  the  available  dynamic  memory 
in  some  processor,  after  some  period  of  execution.  This  points  out  one  of  consequences  of 
internally  buffering  all  messages  until  the  window’s  workload  is  completed. 

6.4  Timed  Petri  Nets 

Consider  a  timed  Petri  net  model  of  a  multiprocessor  system  organized  with  a  mesh  commu¬ 
nication  topology.  The  net  models  a  system  where  a  processor  iteratively  receives  a  message 
from  each  of  its  NEWS  (North,  East,  West,  South)  neighbors,  performs  a  computation,  and 
sends  a  result  to  each  NEWS  neighbor.  The  net  models  a  flow  control  policy  that  prevents  a 
processor  from  sending  a  message  to  a  neighbor  until  the  last  message  it  sent  to  that  neigh¬ 
bor  is  consumed.  An  LP  consists  of  the  network  for  one  processor,  a  network  containing 
approximately  thirty  places  and  ten  transitions.  Nearly  all  transitions  have  a  unit  time  delay 
associated  with  them.  Transitions  modeling  the  processor  execution  time  have  200  units  of 
delay. 

This  Petri  net  model  does  not  satisfy  exactly  the  assumptions  we’ve  made  concerning 
simulation  model  behavior.  The  main  difference  is  that  a  token  arriving  to  an  LP  does  not 
trigger  a  single  LP  activity  with  a  single  duration  time.  The  response  of  the  LP  is  liable 


to  be  much  more  complex.  Nevertheless,  the  basic  synchronization  protocol  works,  lokeiis 
from  an  enabled  transition  are  always  pre-sent  (regardless  of  whether  they  are  sent  to  places 
within  the  /,/’);  to  compute  lookahead,  an  L  P  adds  t  he  minimum  delay  among  all  transitions 
that  send  tokens  to  other  /.  /Js  to  the  least-time  token  arrival  event  in  the  LP  s  event  list. 

Ihe  grid  size  for  the  simulated  system  can  be  described  in  terms  of  dimension  in  the 
same  way  as  was  the  (lame  of  Life. 


Dim 

3  i  r,  a 

P 

:r.%  <>•_>'/  si-/  iti'/ 

1  he  comparatively  better  performance  of  this  problem  can  be  attributed  to  its  better  ratio 
of  computation  costs  to  LP- message  costs. 


7  Conclusions 

We  have  analyzed  a  simple  conserved ive  synchronization  protocol  for  parallel  discrete-event 
simulation.  The  protocol  presumes  t  hat  one  can  pre-sa  mple  act  i  vity  durat  ion  t  irnes  ( or  bound 
those  times  from  below),  that  the  immediate  effects  of  simulation  model  state  change's  are 
very  local,  and  that  all  queueing  disciplines  are  noil-preemptive.  The  protocol  essentially 
slides  a  window  across  simulation  time;  the  window  is  defined  so  that  processors  can  evalu¬ 
ate  all  their  window  events  in  parallel.  We  construct  an  approximated  lower  bound  on  the 
average  number  of  (’vents  processed  per  window.  The  bound  depends  on  the  topology  and 
activity  rates  of  the  heterogeneous  simulation  domain.  The  performance  analysis  shows  that 
a  great  deal  of  workload  can  be  performed  in  parallel,  if  there  is  a  great  deal  of  concurrent 
activity  in  the  simulation  model.  Non-zero  minimal  activity  durations  are  shown  to  greatly 
improve  performance.  We  show  that  the  asymptotic  time  complexity  of  the  average  total 
overhead  (synchronization,  lookahead  calculations,  processor  idle  time,  event  list  manipula¬ 
tion)  per  event  is  that  of  of  an  optimized  serial  simulation.  Assuming  that  the  complexity 
of  the  communication  cost  per  event  is  no  greater  than  the  overhead  of  an  event  in  a  serial 
implementation,  the  protocol's  performance  is  within  a  constant  factor  of  optimal.  The  re¬ 
gion  of  problems  where  the  method  does  well  is  precisely  the  region  where  parallel  processing 
is  most  effectively  applied  -  problems  too  large  to  run  serially.  Tim  method  is  verified  by 
implementation  on  a  distributed  memory  multiprocessor,  Good  performance  is  observed  on 
a  variety  of  problems. 


A  Appendix 

In  this  appendix  we  describe  the  tools  used  in  our  analysis,  and  develop  some  key  results. 
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A.l  Stochastic  Dominance 


Our  analysis  relies  on  the  theory  of  stochastic  dominance.  The  definitions  and  results  we 
cite  are  taken  from  Ross  ['A'].  chapter  8. 

Random  variable  A  is  said  to  be  stochastically  lart/t  r  than  random  variable  )  \  I  for  all  / 

Pr{  A’  >  /}  >  Pr{T  >  /}. 

\\ e  then  write  A  >,,  V.  or  V  <,,  A.  An  equivalent  definition  is  that 

E\qi  A')]  >  V  )]  for  all  increasing  functions  ,y. 

In  particular.  ZA[A  J  >  /v[V].  If  A’, . Vu  are  independent  random  variables  and  Vj . ) , 

are  ind<'pet)d('ut  r«.iulom  variable's  such  t  hat  A ,  >sf  ) for  all  /,  1  lieu  lor  all  increasing  fund  ions 

//• 

.V(-Y, . V„)  >,,  ni)\ . )'„)■  (ID 


A. 2  Hazard  Rate  Functions 

If  A  is  a  nonnegative  continuous  random  variable,  it  has  a  hazard  rate  function,  also  known 
as  a  failure  rate  function.  Let  /(/)  be  A  s  density  fund  ion.  and  let  /'(/)  =  Pr{A  >  /}.  1  hen 
A  s  hazard  rate  function  is  defined  to  be 


A(/)  = 


Sit) 


If  A  is  exponential,  then  A( / )  is  identically  the  exponential's  rate  parameter. 

We  rely  on  the  following  results  concerning  hazard  rate  functions. 

•  If  A  \  (t)  and  A  y(t)  are  hazard  rate  functions  for  A  and  V.  and  A.y(0  <  A  y(t)  for  all  t. 
then  A  >,,  V. 

•  If  Ah . V„  are  independent  random  variables  with  hazard  rate  functions  A i ( / ) . A ,,(/). 

then  1  he  hazard  rate  function  for  min{ A’i . \ „ }  is  sim])ly  A,(0- 

•  If  A  has  hazard  rat.e  function  A ( / ) .  then  for  any  /  and  s.  s  <  /, 

Pr{.V  >  / 1  A'  >  .s}  =  exp{  —  j  A  {it)  da). 

This  also  shows  (taking  s  =  0)  that  the  hazard  rate  function  uniquely  defines  a  distri- 
but  ion. 
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A. 3  Important  Bounds 

\\c  now  establish  some  important  bounds  used  in  this  paper. 

Random  variables  constructed  by  randomly  choosing  one  of  a  set  of  random  variables  are 
called  nu.ii tin*.  The  following  lemma  bounds  the  hazard  rate  of  a  certain  class  of  mixtures. 

Lemma  A.l  Lei  A|.A> .  In  unit  I><  mlt  nl  random  rariahhs  with  hazard  rah  functions 

\  i  ( / ) .  \j(/) .  and  snppost  that  tin  si  functions  an  ordinal: 

A ,(/)  <  Ai+!(/ )  lor  all  /  =  1.2 . and  all  /  >  0. 

Ini  />  In  pmbahilitti  s  such  that  p,  —  1.  and  consider  the  random  variable  M 

const ruch  d  by  randomly  selecting  senne  A,,  with  probabil  it  y  p..  Let  A\/(/l  be  M 's  hazard  rah 
I  ■melton.  Ih <  n  for  all  t  >  0 

X 

A.\/(/)  <  /vMO- 

i=i 


Proof:  *’  Let  f,(t)  and  F,(t)  be  the  density  and  cumulative  distribution  functions  for  A,. 
1  lien  A ,(/)  =  /,(/)//■’■(/).  and 


A  A/(0  = 


The  desired  conclusion  will  follow  if 


J2nJ  (>)  <  (f'.l>'Fl(t))(^2pl\,(t)) 

i-i 


for  <•  U  t.  Let  V  —  i  with  probability  />,  and  let  h(Y,t)  =  Ay(0  and  g( Y.  t)  =  —f\  (t).  Then 
for  every  fixed  /,  h  and  g  are  increasing  in  >  .  Application  of  Proposition  7.1.5  of  [25](p. 
227)  yields 

/•;[MV./)]%(V./)]  <  E\h{Yj)g{Y.t)f 


or  equivalent  ly. 

X*  X  -x* 

1=  I  1= 1  1 

As  this  holds  for  every  t  >  0.  the  lemma’s  conclusion  follows. 

□ 

We  now  develop  a  1ow<t  bound  on  the  expected  minimum  of  a  random  number  of  vari¬ 
ables.  each  variable  being  the  sum  of  two  exponentials. 


“This  i  |<  g;int  proof  was  suggested  by  an  anonymous  referee,  replacing  a  far  mere  complicated  proof  of 


Lemma  A. 2  Let  S  —  { (Xt.  l\  ).  ( X2.  /  >) . }  be  a  count  abk  set  where  X,  is  export  ntiul 

with  mti  A,,  and  l  \  is  <  spoilt  ntiul  with  rale  Let  all  these  random  variables  be  independent, 

l.e  t  B\ .  B>. ...  bt  the  st  I  of  jin  i/e  stibst  Is  of  S.  Let  B  be  a  random  set  constructed  bp  ehoosine/ 
B,  with  probability  p,.  Let 

A  =  'tl'r{(Z,.r,)e  B}\,v,. 

1=1 


The  n 


I.\  min 

t  h 


{X, 


Proof:  Consider  the  hazard  rate  function  q,(0  for  X,  -f  l  This  random  variable  is  the 
lifetime  of  a  serial  two-stage  system  where  the  first  stage  lasts  for  time  X,.  and  the  second 
lasts  for  time  i .  q,(0  is  the  instantaneous  probability  density  associated  with  the  system 
dying  at  time  t.  given  that  it  has  survived  up  to  time  t.  Now  if  X;  >  t.  the  system  cannot 
fail  at  f.  whence  y,(0  -  0.  If  Z,  £  /.  then  the  hazard  rate  is  simply  t ha  1  of  Note  that 

this  observation  relies  on  the  memoryless  property  of  the  exponential.  We  may  therefore 
write 


l,(t)  =  (1  -Pr{Z,>M  Zi  +  Ci  >/})<.', 

<  (1  -  Pr{Z,  >  /})e, 

=  ( 1  -  e.\p{-/ \,})V,. 

One'  can  show  that  the  left-hand-side  of  this  inequality  is  equivalent  to  the  more  usual  (and 
complicated)  derivation  of  the  hazard  rate  function  for  the  sum  of  two  exponentials  [20](p. 
1 2(i).  The  fund  ion  on  t  he  right  -hand-side  is  concave  in  /.  and  is  hence  dominated  everywhere 
by  the  line  tangent  to  it  at  /  =  I):  r,(t)  =  / Xtl\.  A  random  variable  l ]  with  hazard  rate 
function  r,(t)  satisfies  <s,  /,  +  /,. 

bet  B,  be  any  finite  subset  of  ,5.  By  (11)  and  the  observations  above  we  may  conclude 
t  hat 

/•'[  min  {X,  +  (',}]>/•,’[  min  {!,}]. 

We  now  focus  on  the  right-hand-side  of  this  inequality.  The  hazard  rate  function  for  M  — 
min{  \  ] |(  Z,.  (  ,)  G  B ,}  is  simply 

)  =  1  ’  (  XI  -Vdi)- 

Without  loss  of  generality  we  may  enumerate  the  finite  subsets  of  5  in  such  a  way  that  if 
i  <  j.  then  A  h.{I)  <  A  nft)  for  all  /.  Let  M  be  a  mixture  of  { A/| .  A/2, .  .  . }.  where  Mj  is 
chosen  with  probability  />,;  let  A \/(t)  be  M  s  hazard  rate  function.  By  Lemma  A.l  w<>  can 
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n  Hi- fm  i:\ces 


l>ouud  A  \ / ( / )  from  above  by  A >■(/).  defined  by 

'X, 

A.\/(0  < 

1=1 

■X- 

=  13}t\iCi  =  Xy(t). 


1=  I 


Let  \  be  a  random  variable  with  hazard  rate  function  A y(t).  Using  the  correspondence 
between  hazard  rate  functions  and  probability  distributions  (see  fjA.2).  we  have 

Pr{  min  { 1  i}  >  /}  >  Pr{ Y  >  /} 

(/,.<',)  e  h 

-  exp{-  /  Aj  (.s)  ({»} 

Jo 

=  expi-^PrK^.r,)  G  l3}l2\iv,/2} 

1=1 

=  exp{  — A<2/2}. 


Now 


E[  min  {Z,  +  it}]  — 
(Z,.r,)efl 


> 


£  Pr{min{ Z,  +  L'^ZMA)  6  13}  >  t }  dt 

f  exp{  — A/2/2}  dt 
Jo 

(l/\fX)  j  exp{— .s2/2}  d.  s  by  defining  s  =  t\J  A 


□ 
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